Maximum entropy models
Regularization "The maximum-entropy (ME) principle, which prescribes choosing the model that maximizes the entropy out of all models that satisfy given feature constraints, can be seen as a built-in regularization mechanism that avoids overfitting the training data."Stefan Riezler, and Alexander Vasserman. "Incremental Feature Selection and l1 Regularization for Relaxed Maximum-Entropy Modeling." EMNLP. 2004. "For ME models, the use of an l2 regularizer, corresponding to imposing a Gaussian prior on the parameter val- ues, has been proposed by Johnson et al. (1999) and Chen and Rosenfeld (1999). Feature selection for ME models has commonly used simple frequency- based cut-off, or likelihood-based feature induction as introduced by Della Pietra et al. (1997)." "Tibshirani (1996) proposed a technique based on l1 regularization that embeds feature selection into regularization such that both a precise assessment of the reliability of features and the decision about in- clusion or deletion of features can be done in the same framework." "a combined incremental feature selection and regularization method can be established for maximum entropy modeling by a natural incorporation of the regularizer into gradient-based feature selection, following Perkins et al. (2003)." Optimization A survey: Malouf (2002)Malouf, Robert. "A comparison of algorithms for maximum entropy parameter estimation." proceedings of the 6th conference on Natural language learning-Volume 20. Association for Computational Linguistics, 2002. comparing "Generalized Iterative Scaling and Improved Iterative Scaling, as well as general purposed optimization techniques such as gradient ascent, conjugate gradient, and variable metric methods". Big "vocabulary" cases Johnson et al. (1999)Johnson, Mark, et al. "Estimators for stochastic unification-based grammars." Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999. dealt with probability of syntactic structures. Finding normalization term is difficult since it entails summing over the infinite set of possible syntactic structures. Some solutions are proposed: Monte-Carlo estimation (Abney, 1997Steven P. Abney. 1997. Stochastic Attribute- Value Grammars. Computational Linguis- tics, 23(4):597–617.) which is not efficient, pseudo-likelihodd i.e. estimating the normalization term (Johnson et al., 1999). I call this problem "big vocabulary" because of its resemblance to the problem in language modeling. Applications * Language model: Chen et al. (1999)Chen, Stanley F., and Ronald Rosenfeld. "Efficient sampling and feature selection in whole sentence maximum entropy language models." Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 1. IEEE, 1999. * Sentence boundary detection: Ratnaparkhi (1998)Adwait Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. Diss. University of Pennsylvania, 1998. * POS tagging: Ratnaparkhi (1998) * Syntax parsing: Ratnaparkhi (1998) * Unsupervised prepositional phrase attachment: Ratnaparkhi (1998) * Recommender system: Jin et al. (2005)Jin, Xin, Yanzan Zhou, and Bamshad Mobasher. "A maximum entropy web recommendation system: combining collaborative and content features." Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. * Named-entity recognition: Borthwick (1999)Borthwick, Andrew. A maximum entropy approach to named entity recognition. Diss. New York University, 1999., Curran and Clark (2003)Curran, James R., and Stephen Clark. "Language independent NER using a maximum entropy tagger." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003. * Semantic role labeling: ** PropBank-style: Che et al. (2009)Che, W., Li, Z., Li, Y., Guo, Y., Qin, B., & Liu, T. (2009, June). Multilingual dependency-based syntactic and semantic parsing. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task (pp. 49-54). Association for Computational Linguistics. ("During the SRC stage, a Maximum entropy (Berger et al., 1996) classifier is used to predict the probabilities of a word in the sentence") ** NomBank: Jiang and Ng (2006)Jiang, Zheng Ping, and Hwee Tou Ng. "Semantic role labeling of NomBank: A maximum entropy approach." Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006. ** FrameNet-style: Fleischman et al. (2003)M. Fleischman, N.Kwon, and E. Hovy. 2003. Maximum entropy models for FrameNet classification. In Proc. of EMNLP., Das et al. (2010)Das, D., Schneider, N., Chen, D., & Smith, N. A. N. (2010). Probabilistic frame-semantic parsing. HLT ’10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 3(June), 948–956. Retrieved from http://dl.acm.org/citation.cfm?id=1858136\nhttp://dl.acm.org/citation.cfm?id=1857999.1858136 * Coreference resolution: Culotta et al. (2006)Culotta, Aron, et al. "First-order probabilistic models for coreference resolution." (2006). References External links * Seminar paper introducing maximum entropy approach to NLP: Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra. "A maximum entropy approach to natural language processing." Computational linguistics 22.1 (1996): 39-71. PDF * Vasilis Vryniotis. Machine Learning Tutorial: The Max Entropy Text Classifier * Adwait Ratnaparkhi. A Simple Introduction to Maximum Entropy Models for Natural Language Processing Category:Machine learning